Instructions

R markdown is a plain-text file format for integrating text and R code, and creating transparent, reproducible and interactive reports. An R markdown file (.Rmd) contains metadata, markdown and R code “chunks”, and can be “knit” into numerous output types. Answer the test questions by adding R code to the fenced code areas below each item. There are questions that require a written answer that also need to be answered. Enter your comments in the space provided as shown below:

Answer: (Enter your answer here.)

Once completed, you will “knit” and submit the resulting .html document and the .Rmd file. The .html will present the output of your R code and your written answers, but your R code will not appear. Your R code will appear in the .Rmd file. The resulting .html document will be graded and a feedback report returned with comments. Points assigned to each item appear in the template.

Before proceeding, look to the top of the .Rmd for the (YAML) metadata block, where the title, author and output are given. Please change author to include your name, with the format ‘lastName, firstName.’

If you encounter issues with knitting the .html, please send an email via Canvas to your TA.

Each code chunk is delineated by six (6) backticks; three (3) at the start and three (3) at the end. After the opening ticks, arguments are passed to the code chunk and in curly brackets. Please do not add or remove backticks, or modify the arguments or values inside the curly brackets. An example code chunk is included here:

# Comments are included in each code chunk, simply as prompts

#...R code placed here

#...R code placed here

R code only needs to be added inside the code chunks for each assignment item. However, there are questions that follow many assignment items. Enter your answers in the space provided. An example showing how to use the template and respond to a question follows.


Example Problem with Solution:

Use rbinom() to generate two random samples of size 10,000 from the binomial distribution. For the first sample, use p = 0.45 and n = 10. For the second sample, use p = 0.55 and n = 10. Convert the sample frequencies to sample proportions and compute the mean number of successes for each sample. Present these statistics.

set.seed(123)
sample.one <- table(rbinom(10000, 10, 0.45)) / 10000
sample.two <- table(rbinom(10000, 10, 0.55)) / 10000

successes <- seq(0, 10)

round(sum(sample.one*successes), digits = 1) # [1] 4.5
## [1] 4.5
round(sum(sample.two*successes), digits = 1) # [1] 5.5
## [1] 5.5

Question: How do the simulated expectations compare to calculated binomial expectations?

Answer: The calculated binomial expectations are 10(0.45) = 4.5 and 10(0.55) = 5.5. After rounding the simulated results, the same values are obtained.


Submit both the .Rmd and .html files for grading. You may remove the instructions and example problem above, but do not remove the YAML metadata block or the first, “setup” code chunk. Address the steps that appear below and answer all the questions. Be sure to address each question with code and comments as needed. You may use either base R functions or ggplot2 for the visualizations.


##Data Analysis #2

## 'data.frame':    1036 obs. of  10 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ RATIO : num  0.15 0.147 0.269 0.185 0.165 ...

Test Items starts from here - There are 10 sections - total of 75 points

#### Section 1: (5 points) ####

(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using ‘rockchalk.’ Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## [1] 0.7147056
## [1] 4.667298

(1)(b) Tranform RATIO using log10() to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.

## [1] -0.09391548
## [1] 3.535431

(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  mydata$RATIO and mydata$CLASS
## Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531
## 
##  Bartlett test of homogeneity of variances
## 
## data:  mydata$L_RATIO and mydata$CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267

Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?

Answer: Referring to when using the RATIO variable - the distribution is not normal given that we can see visually via the histogram that the data is right skewed. We can also see the deviations in the qqplot that would suggest that the distribution is close, but not normal.However, considering that the skewness is positive and greater than zero, and the kurtosis is greater than 3, we can conclude that the distribution is not normal.

Taking the logarithm of the RATIO variable is an attempt to normalize the variable as it will shrink the variable values by pulling in extreme values while preserving their relative order. After taking the logarithm of RATIO, we observe that visually both in the Histogram and qqplot a distribution closer to normal than before. Furthermore, we see that the skewness is closer to 0, and that the kurtosis is closer to 3 (though not exactly 3). When observing the results of the Bartlett test, we see that the logarithm of RATIO better conforms to homogeneous variances given that the p value dictates that we CANNOT reject the null hypotheses of the variance in L_RATIO being homogeneous amongst the different CLASSES. Whereas the RATIO (without the log transformation) shows that the null hypotheses of the variance being homogeneous amongst the different CLASSES, can be rejected.

#### Section 2 (10 points) ####

(2)(a) Perform an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use summary() to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.524 < 2e-16 ***
## SEX            2  0.091 0.04569   6.671 0.00132 ** 
## Residuals   1029  7.047 0.00685                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.370 < 2e-16 ***
## SEX            2  0.091 0.04569   6.644 0.00136 ** 
## CLASS:SEX      8  0.027 0.00334   0.485 0.86709    
## Residuals   1021  7.021 0.00688                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?

Answer: The affect of the interaction term is not significant as adding it does not reduce the Mean Squared Error - hence does not reduce residuals. However, what it suggests is that the affect variations in CLASS can have on the expected value of L_RATIO of abalones, does not depend on variations in SEX (and vice versa).

(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the TukeyHSD() function. Interpret the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fm1, data = mydata)
## 
## $CLASS
##              diff         lwr          upr     p adj
## A2-A1 -0.01248831 -0.03876038  0.013783756 0.6919456
## A3-A1 -0.03426008 -0.05933928 -0.009180867 0.0018630
## A4-A1 -0.05863763 -0.08594237 -0.031332896 0.0000001
## A5-A1 -0.09997200 -0.12764430 -0.072299703 0.0000000
## A3-A2 -0.02177176 -0.04106269 -0.002480831 0.0178413
## A4-A2 -0.04614932 -0.06825638 -0.024042262 0.0000002
## A5-A2 -0.08748369 -0.11004316 -0.064924223 0.0000000
## A4-A3 -0.02437756 -0.04505283 -0.003702280 0.0114638
## A5-A3 -0.06571193 -0.08687025 -0.044553605 0.0000000
## A5-A4 -0.04133437 -0.06508845 -0.017580286 0.0000223
## 
## $SEX
##             diff          lwr           upr     p adj
## I-F -0.015890329 -0.031069561 -0.0007110968 0.0376673
## M-F  0.002069057 -0.012585555  0.0167236690 0.9412689
## M-I  0.017959386  0.003340824  0.0325779478 0.0111881

Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as ‘adults?’ If not, why not?

Answer: Comparing the differences in coefficients, we see: varying CLASS from A2 to A1 won’t produce much statistically significant impact on the expected value of L_RATIO. Whereas, varying CLASS from A3 to A1 will produce a statistically significant impact on the expected value of L_RATIO; as will varying CLASS from A4 to A1 and A5 to A1. This suggests that the expected value for L_RATIO for abalones assigned classes A3, A4 and A5 could be materially different than those abalones assigned classes A1 and A2. Furthermore, if we look at A3, A4 and A5 in and of themselves, we see that varying CLASS from A5 to A3 has a greater statistically significant impact on the expected value of L_RATIO in comparison to A4 to A3 (even though that too is statistically significant). Looking at gender, however, it appears that the difference in the impact on the expected value of L_RATIO when varying from Male to Female isn’t statistically significant.Therefore, we can consider merging them into one level labeled “adults”.

#### Section 3: (10 points) ####

(3)(a1) Here, we will combine “M” and “F” into a new level, “ADULT”. The code for doing this is given to you. For (3)(a1), all you need to do is execute the code as given.

## 
##     I ADULT 
##   329   707

(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.

Essay Question: Compare the histograms. How do the distributions differ? Are there going to be any difficulties separating infants from adults based on VOLUME?

Answer: The distribution for the adults looks normal. Whereas, the distribution for the infants is skewed to the right as more data seems focused less than 250. Therefore, it will be difficult to differentiate Infants and Adults based on Volume due to little cross-over in the data.

(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.

Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?

Answer: The log base 10 transformation reduces the variability present when comparing SHUCK and VOLUME. We also see less extreme values as the log transformation pulls in the extreme values whilst still preserving their relative order. That being said, most of the data points are now clustered in the right top part of the graph versus the bottom left part of the graph - it seems that whereas before the transformation the data might have been right skewed, but after the transformation, the data appears to be skeweing towards the left. As far as the affects on our linear regression assumptions are concerned - the linear relationship between SHUCK and VOLUME still holds true. However, if we were to model both the log of VOLUME and log of SHUCK, we would have to take that into consideration while interpreting the coefficients of the model. Observing SHUCK and VOLUME by TYPE, we can see that the Infants have a much lower VOLUME and SHUCK than adults. Observing SHUCK and VOLUME by CLASS, on the other hand, we can see that A1 seems to have lower SHUCK and VOLUME in comparison to the other classes. Additionally, classes A3, A4 and A5 seem to be clustered together quite heavily - however, we can see that A5 has lower values in SHUCK weight in comparison to A3 & A4. We also observe that A3 has higher shuck weight in comparison to the levels it appears to be clustered with.

#### Section 4: (5 points) ####

(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. You are given code in (4)(a1) to reclassify the infants in classes A4 and A5 as ADULTS.

## 
##     I ADULT 
##   289   747

(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply summary() to the model object to produce results.

## 
## Call:
## lm(formula = fm3, data = mydata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270634 -0.054287  0.000159  0.055986  0.309718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.817512   0.019040 -42.936  < 2e-16 ***
## L_VOLUME     0.999303   0.010262  97.377  < 2e-16 ***
## CLASSA2     -0.018005   0.011005  -1.636 0.102124    
## CLASSA3     -0.047310   0.012474  -3.793 0.000158 ***
## CLASSA4     -0.075782   0.014056  -5.391 8.67e-08 ***
## CLASSA5     -0.117119   0.014131  -8.288 3.56e-16 ***
## TYPEADULT    0.021093   0.007688   2.744 0.006180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9501 
## F-statistic:  3287 on 6 and 1029 DF,  p-value: < 2.2e-16

Essay Question: Interpret the trend in CLASS level coefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).

Answer: The coefficients for CLASS are negative, and increase as we progress from A1 to A5. This shows (where A1 is the baseline, and all else held constant) that as the abalones age, they tend to have lower growth in SHUCK weight (currently the log of SHUCK weight).This interpretation is consistent with the information in the prior displays when observing SHUCK by CLASS.

Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.

Answer: The baseline for TYPE is Infant, so the coefficient in the results output is comparing the affect on SHUCK Weight when TYPE is an ADULT in comparison to when TYPE is an Infant. It shows that for ADULTS the increase in the SHUCK weight will be 2.1% in comparison to its Infant counterparts. The percentage increase in SHUCK weight isn’t rather large, especially when compared to some of the classes. However, it does contribute to predictions in SHUCK weight, more so when the CLASS variable is classified as A2.


The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).


#### Section 5: (5 points) ####

(5)(a) If “model” is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## [1] -0.05945234
## [1] 3.343308

(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using par(mfrow..) or grid.arrange(). Test the homogeneity of variance of the residuals across classes using bartlett.test() (Kabacoff Section 9.3.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  mydata$Model_Residuals and mydata$CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model ‘fit’? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.

Answer: The display and calculations in 5a) suggest that the model residuals are normally distributed. Furthermore, the displays and calculations in 5b) suggest that the model residuals are homogeneous across types, classes and measures of the log of volume. Given that the above are true, VOLUME can be useful for harvesting decisions as a stand-in for SHUCK. It needs to be stated that VOLUME is being used as a stand-in for SHUCK given that we’re comparing residuals from a model where we were the predictions and actuals are referencing the log of SHUCK (consequently SHUCK).


Harvest Strategy:

There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.The Management needs to make a decision to implement 1 rule that meets the business goal.

The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).” Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.


#### Section 6: (5 points) ####

(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes. Code for doing this is provided.

(6)(b) Our first “rule” will be protection of all infants. We want to find a volume cutoff that protects all infants, but gives us the largest possible harvest of adults. We can achieve this by using the volume of the largest infant as our cutoff. You are given code below to identify the largest infant VOLUME and to return the proportion of adults harvested by using this cutoff. You will need to modify this latter code to return the proportion of infants harvested using this cutoff. Remember that we will harvest any individual with VOLUME greater than our cutoff.

## [1] 526.6383
## [1] 0.2476573
## [1] 0

(6)(c) Our next approaches will look at what happens when we use the median infant and adult harvest VOLUMEs. Using the median VOLUMEs as our cutoffs will give us (roughly) 50% harvests. We need to identify the median volumes and calculate the resulting infant and adult harvest proportions for both.

##        I 
## 133.8214
## [1] 0.4982699
## [1] 0.9330656
##    ADULT 
## 384.5584
## [1] 0.02422145
## [1] 0.4993307

(6)(d) Next, we will create a plot showing the infant conserved proportions (i.e. “not harvested,” the prop.infants vector) and the adult conserved proportions (i.e. prop.adults) as functions of volume.value. We will add vertical A-B lines and text annotations for the three (3) “rules” considered, thus far: “protect all infants,” “median infant” and “median adult.” Your plot will have two (2) curves - one (1) representing infant and one (1) representing adult proportions as functions of volume.value - and three (3) A-B lines representing the cutoffs determined in (6)(b) and (6)(c).

Essay Question: The two 50% “median” values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?

Answer: If we used the Median Infant Value as a cutoff for harvesting, we would harvest almost 50% of all our Infants, but 93% of most adults. If we used the Median Adult value as a cutoff for harvesting, we would harvest 50% of all our adults, but around only 2.3% of all infants. Just to add: visually, this makes sense as well given that there seem to be few Infant data points beyond the Median Adult Value given the few jumps in the red line. The challenge here is presented by abalones that have volumes above the Median Infant Volume, but the below the Median Adult Volume. It would be difficult to detect Infants that have Volumes above the Median Infant Volume, and they could easily pass as adults to be harvested - we see that numerically as well given that we’d harvest 50% of all Infants if we used the Median Infant value as our cutoff point.If we used the median adult value, on the other hand, we harvest only 2.4% of all Infants - making it a strong contender for our cutoff point. However, given that the harvest would only yield 50%, the harvesters would either have to adjust expectations around yield, or assess additional ways to detect whether an abalone is an Infant or an Adult.


More harvest strategies:

This part will address the determination of a cutoff volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. In other words, we want to find the volume value such that the vertical distance between the infant curve and the adult curve is maximum. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.

Note on ROC:

There are multiple packages that have been developed to create ROC curves. However, these packages - and the functions they define - expect to see predicted and observed classification vectors. Then, from those predictions, those functions calculate the true positive rates (TPR) and false positive rates (FPR) and other classification performance metrics. Worthwhile and you will certainly encounter them if you work in R on classification problems. However, in this case, we already have vectors with the TPRs and FPRs. Our adult harvest proportion vector, (1 - prop.adults), is our TPR. This is the proportion, at each possible ‘rule,’ at each hypothetical harvest threshold (i.e. element of volume.value), of individuals we will correctly identify as adults and harvest. Our FPR is the infant harvest proportion vector, (1 - prop.infants). At each possible harvest threshold, what is the proportion of infants we will mistakenly harvest? Our ROC curve, then, is created by plotting (1 - prop.adults) as a function of (1 - prop.infants). In short, how much more ‘right’ we can be (moving upward on the y-axis), if we’re willing to be increasingly wrong; i.e. harvest some proportion of infants (moving right on the x-axis)?


#### Section 7: (10 points) ####

(7)(a) Evaluate a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value. Compare to the 50% “split” points determined in (6)(a). There is considerable variability present in the peak area of this plot. The observed “peak” difference may not be the best representation of the data. One solution is to smooth the data to determine a more representative estimate of the maximum difference.

(7)(b) Since curve smoothing is not studied in this course, code is supplied below. Execute the following code to create a smoothed curve to append to the plot in (a). The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.

(7)(c) Present a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value with the variable smooth.difference superimposed. Determine the volume.value corresponding to the maximum smoothed difference (Hint: use which.max()). Show the estimated peak location corresponding to the cutoff determined.

Include, side-by-side, the plot from (6)(d) but with a fourth vertical A-B line added. That line should intercept the x-axis at the “max difference” volume determined from the smoothed curve here.

(7)(d) What separate harvest proportions for infants and adults would result if this cutoff is used? Show the separate harvest proportions. We will actually calculate these proportions in two ways: first, by ‘indexing’ and returning the appropriate element of the (1 - prop.adults) and (1 - prop.infants) vectors, and second, by simply counting the number of adults and infants with VOLUME greater than the vlume threshold of interest.

Code for calculating the adult harvest proportion using both approaches is provided.

## [1] 0.7416332
## [1] 0.1764706

There are alternative ways to determine cutoffs. Two such cutoffs are described below.


#### Section 8: (10 points) ####

(8)(a) Harvesting of infants in CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting infants from CLASS “A1.”

Compute this cutoff, and the proportions of infants and adults with VOLUME exceeding this cutoff. Code for determining this cutoff is provided. Show these proportions. You may use either the ‘indexing’ or ‘count’ approach, or both.

## [1] 0.2871972
## [1] 0.8259705

(8)(b) Next, append one (1) more vertical A-B line to our (6)(d) graph. This time, showing the “zero A1 infants” cutoff from (8)(a). This graph should now have five (5) A-B lines: “protect all infants,” “median infant,” “median adult,” “max difference” and “zero A1 infants.”

#### Section 9: (5 points) ####

(9)(a) Construct an ROC curve by plotting (1 - prop.adults) versus (1 - prop.infants). Each point which appears corresponds to a particular volume.value. Show the location of the cutoffs determined in (6), (7) and (8) on this plot and label each.

(9)(b) Numerically integrate the area under the ROC curve and report your result. This is most easily done with the auc() function from the “flux” package. Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.

## [1] 0.8666894

#### Section 10: (10 points) ####

(10)(a) Prepare a table showing each cutoff along with the following: 1) true positive rate (1-prop.adults, 2) false positive rate (1-prop.infants), 3) harvest proportion of the total population

To calculate the total harvest proportions, you can use the ‘count’ approach, but ignoring TYPE; simply count the number of individuals (i.e. rows) with VOLUME greater than a given threshold and divide by the total number of individuals in our dataset.

##                    Volume  TPR  FPR totalHarvest
## Max Difference     262.14 0.74 0.18         0.58
## Zero A1 Inf        206.79 0.83 0.29         0.68
## Median Adult       384.56 0.50 0.02         0.37
## Median Infant      133.82 0.93 0.50         0.81
## Protect All Infant 526.64 0.25 0.00         0.18

Essay Question: Based on the ROC curve, it is evident a wide range of possible “cutoffs” exist. Compare and discuss the five cutoffs determined in this assignment.

Answer: The Protect All Infant shows the lowest false positive rate, but also shows the lowest proportion of all harvested abalones. This makes sense given earlier we’d seen that if we’d established the Protect All Infants as our cut-off point, we’d have harvested 0% of Infants, and 25% of Adults. The Median Infant cut-off point showcases that 93% of adults were harvested, whereas, close to 50% of Infants were harvested - this yields the highest harvest yield. The Median Adult cut-off point showcases that 50% of adults were harvested, and only 2% of Infants were harvested - yielding a lower result than before. The Max Difference and Zero A1 Inf both offer good harvesting rates, while also keeping false positives (infants harvested) low.

Final Essay Question: Assume you are expected to make a presentation of your analysis to the investigators How would you do so? Consider the following in your answer:

  1. Would you make a specific recommendation or outline various choices and tradeoffs?
  2. What qualifications or limitations would you present regarding your analysis?
  3. If it is necessary to proceed based on the current analysis, what suggestions would you have for implementation of a cutoff?
  4. What suggestions would you have for planning future abalone studies of this type?

Answer: I would recommend using the Max Difference cut-off point given that it showcases a strong true positive rate, as well as a low false positive rate whilst also showcasing high harvesting yields. The recommendation would also help sustain the abalone species given the low false positive rate. However, if the abalone species are proceeding towards extinction, then the cut-off should be adjusted to Protect All Infants. For planning for the future, more data should be gathered to help detect differences between infant and adult abalones. The current data set was fairly restricted. Furthermore, it would also be helpful to gather data from different harvest farms around different coasts where abalones are found to help better understand the species.